string->utf8, string->utf16, string->utf32, utf8->string, utf16->string, utf32->string - convert between strings and bytevectors

LIBRARY

(import (rnrs))                     ;R6RS
(import (rnrs bytevectors))         ;R6RS
(import (scheme base))              ;R7RS

SYNOPSIS

(string->utf8 string)
(string->utf8 string start)           ;R7RS
(string->utf8 string start end)       ;R7RS
(utf8->string bytevector)
(utf8->string bytevector start)       ;R7RS
(utf8->string bytevector start end)   ;R7RS

;; The following procedures are in R6RS and are absent from R7RS.

(string->utf16 string)
(string->utf16 string endianness)
(string->utf32 string)
(string->utf32 string endianness)
(utf16->string bytevector endianness)
(utf16->string bytevector endianness endianness-mandatory?)
(utf32->string bytevector endianness)
(utf32->string bytevector endianness endianness-mandatory?)

DESCRIPTION

These procedures convert between string and bytevector representations of strings in various Unicode encodings.

The string->utf8, string->utf16, and string->utf32 procedures return a bytevector that contains an encoding of string (with no byte-order mark). The utf8->string, utf16->string, and utf32->string procedures return a string whose character sequence is encoded by bytevector.

utf8->string, string->utf8
These procedures use the UTF-8 encoding.
string->utf16
This procedure encodes according to UTF-16BE (default) or UTF-16LE.
string->utf32
This procedure encodes according to UTF-32BE (default) or UTF-32LE.
utf16->string
This procedure decodes according to UTF-16, UTF-16BE, UTF-16LE, or a fourth encoding scheme that differs from all of those, as in the description of endianness-mandatory? below.
utf32->string
This procedure decodes according to UTF-32, UTF-32BE, UTF-32LE, or a fourth encoding scheme that differs from all of those, as in the description of endianness-mandatory? below.
Endianness
If endianness is specified, it must be the symbol big or the symbol little. This differs from other bytevector procedures that can support additional implementation-defined endianness values. See native-endianness(3scm) for a definition of endianness. The default endianness for the string-> procedures is big. The endianness concept is not applicable to UTF-8.
Byte-order marks
A UTF-16 BOM is either the sequence of bytes #xFE, #xFF specifying big and UTF-16BE, or #xFF, #xFE specifying little and UTF-16LE.

A UTF-32 BOM is either the sequence of bytes #x00, #x00, #xFE, #xFF specifying big and UTF-32BE, or #xFF, #xFE, #x00, #x00, specifying little and UTF-32LE.

A UTF-8 BOM is the sequence of bytes #xEF, #xBB, #xBF. Neither R6RS nor R7RS mentions the UTF-8 BOM. It does not specify an endianness, but is sometimes used as a magic string to mark UTF-8 text.

The endianness-mandatory? argument (the fourth encoding scheme)
If endianness-mandatory? is absent or #f, then utf16->string and utf32->string determine the endianness according to a BOM at the beginning of bytevector if a BOM is present; in this case, the BOM is not decoded as a character. Also in this case, if no BOM is present, endianness specifies the endianness of the encoding. If endianness-mandatory? is a true value, endianness specifies the endianness of the encoding, and any BOM in the encoding is decoded as a regular character.
Decoding errors
If an invalid or incomplete character encoding is encountered, then the replacement character U+FFFD is appended to the string being generated, an appropriate number of bytes are ignored, and decoding continues with the following bytes.
R7RS
R7RS provides two extra arguments for restricting the transcoding operation to only a part of the input. Only UTF-8 is provided (and possibly only a small subset, see the errors section).

IMPLEMENTATION NOTES

Chez Scheme
There is a single empty bytevector object and a single empty string object. If these are returned then they are not newly allocated. Chez Scheme removes an initial UTF-8 BOM.
Loko Scheme
Same notes as for Chez Scheme.

RETURN VALUES

Returns a single (unless empty) newly allocated bytevector or string object.

EXAMPLES

;; The #vu8() syntax is used in R6RS. R7RS uses #u8() instead.

(utf8->string #vu8(#x41))
          => "A"

(string->utf8 "λ")
          => #vu8(#xCE #xBB)

APPLICATION USAGE

These procedures are used when interfacing with external systems, other processes, and where strings are encoded in one of the supported encodings. File operations are usually handled better using a transcoded port, except in cases where the file structure as such is binary and only some parts represent strings.

COMPATIBILITY

The UTF-8 variants of these procedures are present in both R6RS and R7RS, but R6RS is missing the start and end arguments.

The number of bytes skipped when decoding an invalid or incomplete character differs between implementations. Relying on the precise number of bytes skipped, or the number of replacement characters used, is not portable.

Some implementations do not return newly allocated strings or bytevectors if they are empty, as they have a single copy of each.

For R7RS, also see the note below in the errors section.

ERRORS

This procedure can raise exceptions with the following condition types:
&assertion (R6RS)
The wrong number of arguments was passed or an argument was outside its domain. Somewhat unusually, the endianness-mandatory? argument can be any object.
Unsupported characters (R7RS)
It is an error to pass utf8->string a character in UTF-8 encoded form which the implementation does not support. 7-bit ASCII (except #\null) must be supported. Any other character is optional and potentially an error. You can use the full-unicode feature identifier in cond-expand(7scm) to check if all of Unicode 6.0 is supported.
R7RS
The assertions described above are errors. Implementations may signal an error, extend the procedure's domain of definition to include such arguments, or fail catastrophically.

SEE ALSO

string->bytevector(3scm), bytevector->string(3scm), transcoded-port(3scm),

STANDARDS

R6RS, R7RS

AUTHORS

This page is part of the scheme-manpages project. It includes materials from the RnRS documents. More information can be found at https://github.com/schemedoc/manpages/.


Markup created by unroff 1.0sc,    March 04, 2023.